Task #2a: Load the Telcom Customer Churn dataset into a DataFrame.
Task #2b: Confirm the import by executing the head method on your DataFrame.
Task #2c: Determine the ‘shape’ of your DataFrame. Then list the fields with their datatypes using the info method.
# 2a
import pandas as pd
churn = pd.read_csv('churn.csv', delimiter = ',', thousands = ',')
# 2b
churn.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
# 2c
print(churn.shape)
churn.info() #checking number of null values
(7043, 21) <class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): customerID 7043 non-null object gender 7043 non-null object SeniorCitizen 7043 non-null int64 Partner 7043 non-null object Dependents 7043 non-null object tenure 7043 non-null int64 PhoneService 7043 non-null object MultipleLines 7043 non-null object InternetService 7043 non-null object OnlineSecurity 7043 non-null object OnlineBackup 7043 non-null object DeviceProtection 7043 non-null object TechSupport 7043 non-null object StreamingTV 7043 non-null object StreamingMovies 7043 non-null object Contract 7043 non-null object PaperlessBilling 7043 non-null object PaymentMethod 7043 non-null object MonthlyCharges 7043 non-null float64 TotalCharges 7043 non-null object Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
churn.dropna(inplace = True) # eliminate null values
churn.info() # same number of rows as previous, showing that no null values were dropped... interesting
<class 'pandas.core.frame.DataFrame'> Int64Index: 7043 entries, 0 to 7042 Data columns (total 21 columns): customerID 7043 non-null object gender 7043 non-null object SeniorCitizen 7043 non-null int64 Partner 7043 non-null object Dependents 7043 non-null object tenure 7043 non-null int64 PhoneService 7043 non-null object MultipleLines 7043 non-null object InternetService 7043 non-null object OnlineSecurity 7043 non-null object OnlineBackup 7043 non-null object DeviceProtection 7043 non-null object TechSupport 7043 non-null object StreamingTV 7043 non-null object StreamingMovies 7043 non-null object Contract 7043 non-null object PaperlessBilling 7043 non-null object PaymentMethod 7043 non-null object MonthlyCharges 7043 non-null float64 TotalCharges 7043 non-null object Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.2+ MB
Task #3a: For customers with a missing Churn status (neither Yes, or No), fill-in the values to be ‘No’.
#3a
churn.Churn.value_counts() #checking again for null values in Series...5174 + 1869 = 7038 which shows that there are values not being read
No 5174 Yes 1869 Name: Churn, dtype: int64
churn.Churn = churn.Churn.apply(lambda x: 'Yes' if x == 'Yes' else 'No' ) # recoding
churn.Churn.value_counts() #still no change
No 5174 Yes 1869 Name: Churn, dtype: int64
Task #3b: Drop any rows with more than one missing value.
#3b Dropped any N/A's
churn.dropna(inplace = True)
churn.customerID.value_counts() #still no change
3005-NFMTA 1
7823-JSOAG 1
9544-PYPSJ 1
6896-SRVYQ 1
5999-LCXAO 1
..
8639-NHQEI 1
3769-MHZNV 1
5840-NVDCG 1
4343-EJVQB 1
0020-JDNXP 1
Name: customerID, Length: 7043, dtype: int64
Task #3c: Eliminate any duplicate records.
#3c no duplicate customer ID's but for good measure, showing any duplicates:
churn.customerID[churn.customerID.duplicated()]
# row count matches 7043
Series([], Name: customerID, dtype: object)
churn.columns
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
churn.InternetService.value_counts() # totals 7043
Fiber optic 3096 DSL 2421 No 1526 Name: InternetService, dtype: int64
churn.OnlineBackup.value_counts()
No 3088 Yes 2429 No internet service 1526 Name: OnlineBackup, dtype: int64
churn.OnlineSecurity.value_counts()
No 3498 Yes 2019 No internet service 1526 Name: OnlineSecurity, dtype: int64
Task #3d: Create a new column within the DataFrame called ‘Techsavvy’. Rows within this column will have a value of either high, medium, or low. High Techsavvy customers use both online security and online backup and have a Fiber Optic internet service. Medium Techsavvy customers also use Fiber Optic internet service and online security, but not online backup. All other customers are low Techsavvy. As we cannot really tell the level of tech-savvyness for customers who do not use the internet service (i.e., are ‘no’), do not give them any ‘Techsavvy’ rating – instead, put a placeholder value of ‘unknown’ in the Techsavvy column. Note: exclude customers with an ‘unknown’ Techsavvy value from any visualizations (see below).
# 3d creating new column
def tech_savy(x):
if x.InternetService == 'Fiber optic':
if x.OnlineSecurity =='Yes' and x.OnlineBackup == 'Yes':
return 'High'
elif x.OnlineSecurity =='Yes' and x.OnlineBackup == 'No':
return 'Medium'
else:
return 'Low'
elif x.InternetService =='DSL':
return 'Low'
elif x.InternetService =='No':
return 'Unknown'
churn['Techsavvy'] = churn.apply(tech_savy, axis = 1)
churn.Techsavvy.value_counts() #7043 rows...all rows accounted for
Low 4678 Unknown 1526 High 489 Medium 350 Name: Techsavvy, dtype: int64
Task #3e: Create a new column within the DataFrame called ‘Streamer’. Rows within this column will have a value of ‘streamer’ if they subscribe to both StreamingTV and StreamingMovies. Otherwise, a customer will have a streamer value of non-streamer. Again, preclude any customers that do not subscribe to any internet service – give these customers the value of ‘unknown’. Note: exclude customers with an ‘unknown’ Streamer’ value from any visualizations (see below).
# 3e creating a new column
def streamer(x):
if x.InternetService == 'No':
return 'Unknown'
else:
if x.StreamingMovies == 'Yes' and x.StreamingTV == 'Yes':
return 'Streamer'
else:
return 'Non-streamer'
churn['Streamer'] = churn.apply(streamer, axis = 1)
churn.Streamer.value_counts()
Non-streamer 3577 Streamer 1940 Unknown 1526 Name: Streamer, dtype: int64
Task #3f: Replace all occurrences of ‘No internet service’ with ‘NA’.
#3f replacing all No in internet service with NA
churn.InternetService = churn.InternetService.map({'Fiber optic': 'Fiber optic', 'DSL': 'DSL', 'No': 'NA'})
churn.InternetService.value_counts()
Fiber optic 3096 DSL 2421 NA 1526 Name: InternetService, dtype: int64
Task #3g: Recode ‘Bank transfer (automatic)’ and ‘Credit card (automatic)’ to be ‘Automatic’
churn.PaymentMethod.value_counts() # 3g checking for all values
Electronic check 2365 Mailed check 1612 Bank transfer (automatic) 1544 Credit card (automatic) 1522 Name: PaymentMethod, dtype: int64
# maping automatic payments to Automatic
churn.PaymentMethod = churn.PaymentMethod.map({'Electronic check': 'Electronic check', 'Mailed check': 'Mailed check', 'Bank transfer (automatic)' : 'Automatic', 'Credit card (automatic)': 'Automatic'})
churn.PaymentMethod.value_counts() # Automatic does equal total of both previous credit card and bank transfer
Automatic 3066 Electronic check 2365 Mailed check 1612 Name: PaymentMethod, dtype: int64
Task #3h: Create an explicit row index for your DataFrame using the customerID field. Name this index CustomerID.
#3h setting index
churn.set_index('customerID', inplace = True)
churn.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | ... | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | Techsavvy | Streamer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| customerID | |||||||||||||||||||||
| 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | ... | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | Low | Non-streamer |
| 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | ... | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No | Low | Non-streamer |
| 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | ... | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | Low | Non-streamer |
| 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | ... | No | No | One year | No | Automatic | 42.30 | 1840.75 | No | Low | Non-streamer |
| 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | ... | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | Low | Non-streamer |
5 rows × 22 columns
Task #3i: Identify any potential outliers in the column MonthlyCharges. Plot the values to help identify any outliers. If you find potential outliers, change the monthly charge to match the ceiling amount.
churn.MonthlyCharges #3i
customerID
7590-VHVEG 29.85
5575-GNVDE 56.95
3668-QPYBK 53.85
7795-CFOCW 42.30
9237-HQITU 70.70
...
6840-RESVB 84.80
2234-XADUH 103.20
4801-JZAZL 29.60
8361-LTMKD 74.40
3186-AJIEK 105.65
Name: MonthlyCharges, Length: 7043, dtype: float64
#3i no outliers identified ??
import seaborn as sns
import numpy as np
import plotly.express as px
sns.distplot(churn.MonthlyCharges)
<matplotlib.axes._subplots.AxesSubplot at 0x1f312091348>
fig = px.histogram(churn, x = churn.MonthlyCharges, nbins = 50)
fig.show()
Task #3j: Perform a cross-check of tenure multiplied by MonthlyCharges is equal TotalCharges. If the cross-check fails to be within a 10\% range of the recorded TotalCharges, then drop the entire record.
# converting all columns to number data types
#3j
churn.tenure = churn.tenure.astype(int)
churn.MonthlyCharges = churn.MonthlyCharges.astype(float)
# error at position 488 ---- BAD DATA ---drop rows with " " for TotalCharges
churn.TotalCharges = pd.to_numeric(churn.TotalCharges)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string " " During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-27-40c8745cf0ef> in <module> 1 # error at position 488 ---- BAD DATA ---drop rows with " " for TotalCharges 2 ----> 3 churn.TotalCharges = pd.to_numeric(churn.TotalCharges) ~\Anaconda3\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast) 149 coerce_numeric = errors not in ("ignore", "raise") 150 values = lib.maybe_convert_numeric( --> 151 values, set(), coerce_numeric=coerce_numeric 152 ) 153 pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric() ValueError: Unable to parse string " " at position 488
#identifying number of rows with blank TotalCharges
churn.TotalCharges[churn.TotalCharges != " "] # row count 7032 which means there are 11 missing values
customerID
7590-VHVEG 29.85
5575-GNVDE 1889.5
3668-QPYBK 108.15
7795-CFOCW 1840.75
9237-HQITU 151.65
...
6840-RESVB 1990.5
2234-XADUH 7362.9
4801-JZAZL 346.45
8361-LTMKD 306.6
3186-AJIEK 6844.5
Name: TotalCharges, Length: 7032, dtype: object
churn.TotalCharges.dropna(inplace = True) #ineffective
# row number matches
churn.shape # showing dropna is not working
(7043, 22)
#changed all " " to 0 in order to transform column
churn.TotalCharges[churn.TotalCharges == " "] = 0
churn.TotalCharges[churn.TotalCharges != " "] # row count changed to 7043 showing that the 11 missing values have been recoded to 0
customerID
7590-VHVEG 29.85
5575-GNVDE 1889.5
3668-QPYBK 108.15
7795-CFOCW 1840.75
9237-HQITU 151.65
...
6840-RESVB 1990.5
2234-XADUH 7362.9
4801-JZAZL 346.45
8361-LTMKD 306.6
3186-AJIEK 6844.5
Name: TotalCharges, Length: 7043, dtype: object
churn.TotalCharges = pd.to_numeric(churn.TotalCharges)
# creating a new column cross-check, this column is creating with a function that assigns Total Charges. If Total Charges = 0 then Cross Check is NaN, if Total Charges does not fall with 10% of monthly charges x tenure then Cross Check is assigned NaN.
# Cross Check rows assigned NaN will be removed later
def Cross_Check(x):
if x.TotalCharges == 0:
return 'NaN'
else:
a = (x.tenure*x.MonthlyCharges)/x.TotalCharges
if a > .9 and a < 1.10:
return a
else:
return 'NaN'
churn['cross_check'] = churn.apply(Cross_Check, axis = 1)
churn.cross_check.value_counts() #7042 total, 614 :1, 397: NaN
1.0 614
NaN 397
0.9391304347826087 2
1.0419681620839363 2
1.054054054054054 2
...
0.9957003211596848 1
1.007497776083365 1
0.9602308291223303 1
1.0321002751452157 1
0.9528394995865455 1
Name: cross_check, Length: 6031, dtype: int64
#drop all rows not within 10% range
churn.drop(churn[churn['cross_check'] == 'NaN'].index, inplace = True)
# rows check out 7043 - 397 (NaN) = 6646
churn.info()
<class 'pandas.core.frame.DataFrame'> Index: 6646 entries, 7590-VHVEG to 3186-AJIEK Data columns (total 23 columns): gender 6646 non-null object SeniorCitizen 6646 non-null int64 Partner 6646 non-null object Dependents 6646 non-null object tenure 6646 non-null int32 PhoneService 6646 non-null object MultipleLines 6646 non-null object InternetService 6646 non-null object OnlineSecurity 6646 non-null object OnlineBackup 6646 non-null object DeviceProtection 6646 non-null object TechSupport 6646 non-null object StreamingTV 6646 non-null object StreamingMovies 6646 non-null object Contract 6646 non-null object PaperlessBilling 6646 non-null object PaymentMethod 6646 non-null object MonthlyCharges 6646 non-null float64 TotalCharges 6646 non-null float64 Churn 6646 non-null object Techsavvy 6646 non-null object Streamer 6646 non-null object cross_check 6646 non-null object dtypes: float64(2), int32(1), int64(1), object(19) memory usage: 1.2+ MB
Task #4a: Categorize the ‘tenure’ field across four intervals. Ensure the interval cutoffs provide a balanced number (or close to that) of customers within each category and choose appropriate brief labels for each category. (Hint: you can use cut/qcut). Generate a visualization to display the distribution of customers across the categories to confirm the categorization has been performed correctly. Be careful to preserve the ‘tenure’ field, as it will be needed for our Logistic Regression analysis. Place the categorized tenure field (name this field: ‘TenureCat’) immediately to the right of the tenure field within the DataFrame.
#4a creating qcut of tenure into equalish sized bins
tenure = ['Infant','Child', 'Teen', 'Adult']
churn['TenureCat'] = pd.qcut(churn.tenure, 4, labels=tenure)
churn.TenureCat.value_counts()# total of rows = 6646
Teen 1714 Infant 1679 Child 1646 Adult 1607 Name: TenureCat, dtype: int64
# reordering columns
churn.columns
Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
'MonthlyCharges', 'TotalCharges', 'Churn', 'Techsavvy', 'Streamer',
'cross_check', 'TenureCat'],
dtype='object')
churn = churn[['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure','TenureCat',
'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
'MonthlyCharges', 'TotalCharges', 'Churn', 'Techsavvy', 'Streamer',
'cross_check']]
churn.columns
Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
'TenureCat', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn', 'Techsavvy',
'Streamer', 'cross_check'],
dtype='object')
churn.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | TenureCat | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | Techsavvy | Streamer | cross_check | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| customerID | |||||||||||||||||||||
| 7590-VHVEG | Female | 0 | Yes | No | 1 | Infant | No | No phone service | DSL | No | ... | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | Low | Non-streamer | 1 |
| 5575-GNVDE | Male | 0 | No | No | 34 | Teen | Yes | No | DSL | Yes | ... | No | One year | No | Mailed check | 56.95 | 1889.50 | No | Low | Non-streamer | 1.02477 |
| 3668-QPYBK | Male | 0 | No | No | 2 | Infant | Yes | No | DSL | Yes | ... | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | Low | Non-streamer | 0.995839 |
| 7795-CFOCW | Male | 0 | No | No | 45 | Teen | No | No phone service | DSL | Yes | ... | No | One year | No | Automatic | 42.30 | 1840.75 | No | Low | Non-streamer | 1.03409 |
| 9237-HQITU | Female | 0 | No | No | 2 | Infant | Yes | No | Fiber optic | No | ... | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | Low | Non-streamer | 0.93241 |
5 rows × 24 columns
import plotly.express as px
fig = px.histogram(churn, x = churn.TenureCat, color = churn.TenureCat)
fig.show()
Task #4b: For each of the tenure categories, provide a count of customers across the different contract types (one year, two year, etc.). Also report the average monthly charge across the different contract types. Plot this data in an appropriate visualization to promote visual comparison across the tenure categories.
#4b for each TenureCat provide a count of customers across the different contract types
px.histogram(churn, x = churn.TenureCat, color = churn.Contract, barmode = 'group')
#4b Continued: showing average MonthlyCharges by Contract and TenureCat
u = churn.groupby(['Contract', 'TenureCat'])['MonthlyCharges'].mean()
u = u.unstack()
u
| TenureCat | Infant | Child | Teen | Adult |
|---|---|---|---|---|
| Contract | ||||
| Month-to-month | 59.187241 | 70.297854 | 79.303598 | 89.678846 |
| One year | 37.193443 | 51.640086 | 68.231575 | 83.620627 |
| Two year | 34.600000 | 34.773684 | 49.694125 | 70.551685 |
#graph to compare monthly charges across TenureCat by contract type
import plotly.graph_objects as go
import plotly.figure_factory as ff
table_data = u
fig = ff.create_table(table_data, height_constant=50)
Contract_type = u.columns
U1 = u.iloc[0]
U2 = u.iloc[1]
U3 = u.iloc[2]
trace1 = go.Bar(x=Contract_type, y=U1, xaxis='x2', yaxis='y2',
marker=dict(color='lightblue'),
name='Month<br>to<br>Mont')
trace2 = go.Bar(x=Contract_type, y=U2, xaxis='x2', yaxis='y2',
marker=dict(color='salmon'),
name='One<br>Year')
trace3 = go.Bar(x=Contract_type, y=U3, xaxis='x2', yaxis='y2',
marker=dict(color='indianred'),
name='Two<br>Year')
fig.add_traces([trace1, trace2, trace3])
fig['layout']['xaxis2'] = {}
fig['layout']['yaxis2'] = {}
fig.layout.yaxis.update({'domain': [0, .20]})
fig.layout.yaxis2.update({'domain': [.40, 1]})
fig.layout.yaxis2.update({'anchor': 'x2'})
fig.layout.xaxis2.update({'anchor': 'y2'})
fig.layout.yaxis2.update({'title': 'Monthly Charge in Dollars'})
fig.layout.margin.update({'t':75, 'l':50})
fig.layout.update({'title': 'Breakdown of Tenure Monthly Charge'})
fig.layout.update({'height':800})
fig.show()
#7a showing the count of each churn status
px.histogram(churn, x = churn.Churn, color = churn.Churn)
Task #7b: Generate a barplot comparing MonthlyCharges against InternetService.
#7b Generate a barplot comparing MonthlyCharges against InternetService
from plotly.subplots import make_subplots
# groupby to get the appropriate data
sum_IS = churn.groupby('InternetService')['MonthlyCharges'].sum()
mean_IS = churn.groupby('InternetService')['MonthlyCharges'].mean()
fig = make_subplots(rows=1, cols=2)
fig.add_bar(x = sum_IS.index, y=sum_IS,
marker=dict(color="MediumPurple"), text = sum_IS, textposition ='outside',
name="Total Monthly Charges", row=1, col=1)
fig.add_bar(x = sum_IS.index, y=mean_IS,
marker=dict(color="LightSeaGreen"),
name="Average Monthly Charges",text = sum_IS, textposition = 'outside', row=1, col=2, )
fig.layout.title.text = "Monthly Charges by Internet Service"
fig.show()
Task #7c: Generate a violinplot comparing Churn status against InternetService.
#7c Generate a violinplot comparing MonthlyCharges against InternetService. Select Churn as the hue.
mean_IS = churn.groupby(['InternetService', 'Churn'])['MonthlyCharges'].mean()
mean_IS = mean_IS.unstack(1)
y_1 = mean_IS.Yes
y_2 = mean_IS.No
fig = go.Figure()
fig.add_trace(go.Violin(
y=y_1,
name='Churn = Yes',
box_visible=True,
meanline_visible=True))
fig.add_trace(go.Violin(
y=y_2,
name='Churn = No',
box_visible=True,
meanline_visible=True))
fig.layout.title.text = "Monthly Charges by Churn Status and InternetService"
fig.show()
mean_IS
| Churn | No | Yes |
|---|---|---|
| InternetService | ||
| DSL | 60.698384 | 49.984444 |
| Fiber optic | 94.128866 | 88.312179 |
| NA | 21.196634 | 20.398515 |
Task #7d: Generate a series of plots comparing churn status against tenure, MonthlyCharges, TotalCharges, Techsavvy, SeniorCitizen, TenureCat, and Streamer.
Based on the visualizations performed so far, provide at least 3 preliminary observations – for example, do these visualizations provide any indicators of possible good predictors of Churn status? Please record your observations within the Notebook.
#7d Generate a series of plots (Churn vs TotalCharges, Techsavvy, MonthlyCharges, SeniorCitizen, TenureCat, Streamer)
import plotly.graph_objects as go
ch_totalcharge = churn.groupby('Churn')['TotalCharges'].mean()
px.bar(x = ch_totalcharge.index, y = ch_totalcharge)
x = ch_totalcharge.index
y = [ch_totalcharge[0],ch_totalcharge[1]]
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon'],
textposition='auto',
)])
fig.update_layout(title_text='Churn Status vs Average Total Charge')
fig.show()
ch_techsavvy = churn.groupby(['Techsavvy','Churn'])['Churn'].count()
ch_techsavvy = ch_techsavvy.unstack(1)
x = ch_techsavvy.index
y_1 = ch_techsavvy.No
y_2 = ch_techsavvy.Yes
fig = go.Figure()
fig.add_trace(go.Bar(
x=x, y=y_1,
text=y_1,
name = 'No Churn',
marker_color = 'lightblue',
textposition='auto',
))
fig.add_trace(go.Bar(
x=x, y=y_2,
text=y_2,
name = 'Yes Churn',
marker_color = 'salmon',
textposition='auto',
))
fig.update_layout(title_text='Churn Status vs TechSavvy Status')
fig.show()
ch_monthlycharges = churn.groupby('Churn')['MonthlyCharges'].sum()
x = ch_monthlycharges.index
y = [ch_monthlycharges[0],ch_monthlycharges[1]]
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon'],
textposition='auto',
)])
fig.update_layout(title_text='Churn Status vs Total Monthly Charges')
fig.show()
## Clo suggestion: easier to see that there is no pattern I think, need to change the label if we go with that
churn['MonthlyCat']=pd.qcut(churn['MonthlyCharges'], q=20, labels=['1', '2', '3', '4', '5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'])
px.histogram(churn, x = churn.MonthlyCat, color = churn.Churn, barmode = 'group')
churn.drop('MonthlyCat', axis = 1, inplace = True)
churn.SeniorCitizen = churn.SeniorCitizen.apply(lambda x: 'No' if x == 0 else 'Yes')
ch_seniorcitizen = churn.groupby('SeniorCitizen')['Churn'].count()
x = ch_seniorcitizen.index
y = [ch_seniorcitizen[0],ch_seniorcitizen[1]]
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon'],
textposition='auto',
)])
fig.update_layout(title_text='Churn Status vs Senior Citizen')
fig.show()
ch_tenurecat = churn.groupby(['TenureCat', 'Churn'])['Churn'].count()
ch_tenurecat = ch_tenurecat.unstack(1)
x = ch_tenurecat.index
y_1 = ch_tenurecat.No
y_2 = ch_tenurecat.Yes
fig = go.Figure()
fig.add_trace(go.Bar(
x=x, y=y_1,
text=y_1,
name = 'No Churn',
marker_color = 'lightblue',
textposition='auto',
))
fig.add_trace(go.Bar(
x=x, y=y_2,
text=y_2,
name = 'Yes Churn',
marker_color = 'salmon',
textposition='auto',
))
fig.update_layout(title_text='Churn Status vs TenureCat Status')
fig.show()
ch_streamer = churn.groupby(['Streamer', 'Churn'])['Churn'].count()
ch_streamer = ch_streamer.unstack(1)
x = ch_streamer.index
y_1 = ch_streamer.No
y_2 = ch_streamer.Yes
fig = go.Figure()
fig.add_trace(go.Bar(
x=x, y=y_1,
text=y_1,
name = 'No Churn',
marker_color = 'lightblue',
textposition='auto',
))
fig.add_trace(go.Bar(
x=x, y=y_2,
text=y_2,
name = 'Yes Churn',
marker_color = 'salmon',
textposition='auto',
))
fig.update_layout(title_text='Churn Status vs Streamer Status')
fig.show()
Task #7e: Calculate categorical means (i.e., means for each category) for MonthlyCharges across the InternetService, TenureCat’ and Techsavvy categories. Also, calculate categorical means for Tenure across the InternetService, Techsavvy and SeniorCitizen categories. Hint: use groupby. Plot the two most interesting results.
#7e categorical means MonthlyCharges: InternetService, TenureCat, TechSavvy. Tenure: InternetService, TenureCat, SeniorCitizen
mc_internetservice = churn.groupby('InternetService')['MonthlyCharges'].mean()
x = mc_internetservice.index
y = mc_internetservice
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
textposition='auto',
)])
fig.update_layout(title_text='MonthlyCharges vs InternetService')
fig.show()
churn.InternetService
customerID
7590-VHVEG DSL
5575-GNVDE DSL
3668-QPYBK DSL
7795-CFOCW DSL
9237-HQITU Fiber optic
...
6840-RESVB DSL
2234-XADUH Fiber optic
4801-JZAZL DSL
8361-LTMKD Fiber optic
3186-AJIEK Fiber optic
Name: InternetService, Length: 6646, dtype: object
mc_tenurecat = churn.groupby('TenureCat')['MonthlyCharges'].mean()
x = mc_tenurecat.index
y = mc_tenurecat
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
textposition='auto',
)])
fig.update_layout(title_text='MonthlyCharges vs Tenure Category')
fig.show()
#thoughts: is there any way to put medium between low and high? to change the order
mc_techsavvy = churn.groupby('Techsavvy')['MonthlyCharges'].mean()
x = mc_techsavvy.index
y = mc_techsavvy
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
textposition='auto',
)])
fig.update_layout(title_text='MonthlyCharges vs Techsavvy')
fig.update_layout(xaxis={'categoryorder':'mean descending'})
fig.show()
tn_internetservice = churn.groupby('InternetService')['tenure'].mean()
x = tn_internetservice.index
y = tn_internetservice
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon', 'indianred'],
textposition='auto',
)])
fig.update_layout(title_text='Average Tenure vs InternetService')
fig.show()
tn_tenureCat = churn.groupby('TenureCat')['tenure'].mean()
x = tn_tenureCat.index
y = tn_tenureCat
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
textposition='auto',
)])
fig.update_layout(title_text='Tenure vs Tenure Category')
fig.show()
tn_seniorcitizen = churn.groupby('SeniorCitizen')['tenure'].mean()
x = tn_seniorcitizen.index
y = tn_seniorcitizen
fig = go.Figure(data=[go.Bar(
x=x, y=y,
text=y,
marker_color = ['lightblue', 'salmon'],
textposition='auto',
)])
fig.update_layout(title_text='Senior Citizen vs Tenure Category')
fig.show()
Task #7f: Generate a single kdeplot plot comparing the distributions of MonthlyCharges and (TotalCharges divided by tenure).
#7f Generate a single kdeplot plot comparing the distributions of MonthlyCharges and (TotalCharges divided by tenure)
import plotly.figure_factory as ff
a = churn.MonthlyCharges
b = churn.TotalCharges/churn.tenure
colorscale = ['#7A4579', '#D56073', 'rgb(236,158,105)', (1, 1, 0.2), (0.98,0.98,0.98)]
fig = ff.create_2d_density(
a, b, colorscale=colorscale,
hist_color='rgb(255, 237, 222)', point_size=3
)
fig.layout.update({'height':800})
fig.layout.update({'width':1000})
fig.update_layout(title_text='Monthly Charges vs TotalCharges/Tenure')
fig.show()
#
import matplotlib.pyplot as plt
plt.rc("font", size=14)
import seaborn as sns
plt.figure(figsize=(15,8))
ax = sns.kdeplot(a, color="lightblue", shade=True)
sns.kdeplot(b, color="salmon", shade=True)
plt.legend(['Monthly Charges', 'Total Charges over Tenure'])
plt.title('Density Plot')
ax.set(xlabel='Monthly_Charges')
plt.xlim(-10,120)
plt.show()
Task #7g: Generate a kdeplot comparing tenure and monthly charges. As these variables are clearly on different scales, you will need to first normalize the values in both variables (use the min-max normalization approach) and also remove any distorting outliers.
#7g Density plot comparing MonthlyCharges vs Tenure
import sklearn.preprocessing
a = churn.MonthlyCharges
b = churn.tenure
x = sklearn.preprocessing.minmax_scale(a, feature_range=(0, 1), axis=0, copy=True)
y = sklearn.preprocessing.minmax_scale(b, feature_range=(0, 1), axis=0, copy=True)
colorscale = ['#7A4579', '#D56073', 'rgb(236,158,105)', (1, 1, 0.2), (0.98,0.98,0.98)]
fig = ff.create_2d_density(
x, y, colorscale=colorscale,
hist_color='rgb(255, 237, 222)', point_size=3
)
fig.layout.update({'height':800})
fig.layout.update({'width':1000})
fig.update_layout(title_text='Monthly Charges vs Tenure')
fig.show()
import matplotlib.pyplot as plt
plt.rc("font", size=14)
import seaborn as sns
plt.figure(figsize=(15,8))
ax = sns.kdeplot(x, color="lightblue", shade=True)
sns.kdeplot(y, color="salmon", shade=True)
plt.legend(['Tenure', 'Monthly Charges'])
plt.title('Density Plot')
ax.set(xlabel='Monthly_Charges')
plt.xlim(-.25,1.25)
plt.show()
Task #7h: Generate a plot investigating whether the dataset has a gender bias.
#7h countplot of male vs female
import plotly.express as px
px.histogram(churn, x = churn.gender, color = churn.gender)
Task #7i: Generate a series of box plots showing MonthlyCharges against these fields: Techsavvy, Streamer, SeniorCitizen, and InternetService.
px.box(churn, x='Techsavvy', y='MonthlyCharges', color = 'Techsavvy')
px.box(churn, x='Streamer', y='MonthlyCharges', color = 'Streamer')
px.box(churn, x='SeniorCitizen', y='MonthlyCharges', color='SeniorCitizen')
px.box(churn, x='InternetService', y='MonthlyCharges', color='InternetService')
For our intended analytical method – binary linear regression – the independent variables should be independent of each other: that is, the model should have little or no multicollinearity. We would therefore exclude the following fields from further analysis: tenure, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies, as we have created other fields based on these fields. Moreover, we would expect that the TotalCharges would be highly correlated with tenure and MonthlyCharges – accordingly, we would also exclude TotalCharges from our model.
For brevity of analysis, we will now move forward to our modeling stage with a focus on only the following fields:
# 8 Creating new dataframe
drop_list = ['tenure', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'TotalCharges', 'Partner','Dependents','PhoneService','MultipleLines','PaperlessBilling', 'cross_check', 'Contract','gender']
churn_new = churn.drop(drop_list, axis = 1)
churn_new.to_csv('churn_new')
Task #9a: For logistic regression, categorical variables need to be dummied (create dummy variables). As explanation, you cannot directly feed into the model labelled data (e.g., low, medium, high), as it has no clue what you are talking about (i.e., the order) and has trouble comparing values across variables. To address this issue, we need to recode the categorical variables. Pandas has a specific capability to help with this type of recoding – the getdummies method. Application of this method will produce an Indicator matrix (DataFrame), which will need to be merged / concatenated back with the main DataFrame. Also, as several categorical fields feature more than two categories, you will need to create special fields within the DataFrame to accommodate this.
Task #9b: Check to ensure all categorical columns have 1,0s (True/False) values. You can do this through visual inspection, using head.
#9a and 9b
churn_dummies = pd.get_dummies(churn_new)
churn_dummies = churn_dummies.drop(['Churn_No', 'SeniorCitizen_No'], axis = 1)
churn_dummies.head()
| MonthlyCharges | SeniorCitizen_Yes | TenureCat_Infant | TenureCat_Child | TenureCat_Teen | TenureCat_Adult | PaymentMethod_Automatic | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | Techsavvy_High | Techsavvy_Low | Techsavvy_Medium | Techsavvy_Unknown | Streamer_Non-streamer | Streamer_Streamer | Streamer_Unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| customerID | |||||||||||||||||
| 7590-VHVEG | 29.85 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 5575-GNVDE | 56.95 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3668-QPYBK | 53.85 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 7795-CFOCW | 42.30 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 9237-HQITU | 70.70 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
churn_dummies.loc[:, churn_dummies.columns != 'MonthlyCharges'].astype('category')
| SeniorCitizen_Yes | TenureCat_Infant | TenureCat_Child | TenureCat_Teen | TenureCat_Adult | PaymentMethod_Automatic | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | Techsavvy_High | Techsavvy_Low | Techsavvy_Medium | Techsavvy_Unknown | Streamer_Non-streamer | Streamer_Streamer | Streamer_Unknown | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| customerID | ||||||||||||||||
| 7590-VHVEG | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 5575-GNVDE | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3668-QPYBK | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 7795-CFOCW | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 9237-HQITU | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6840-RESVB | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2234-XADUH | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 4801-JZAZL | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 8361-LTMKD | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3186-AJIEK | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
6646 rows × 16 columns
Task #9c: With the dataset now fully prepped for modeling, perform a binary linear regression analysis. Use the scikit-learn package: specifically, the sklearn.linear_model.LogisticRegression class. Given the technical nature of the official documentation, you may find a blog post demonstrating application of this method of some assistance (for example, see below). Note: as the focus of this project is not on mastering methods, it is not necessary to have the application of this method reflect all best practices aimed at model evaluation (e.g., recursive feature elimination, analysis of ROC, and confusion matrix analysis).
https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
https://scikit-learn.org/stable/index.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
#9c
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from imblearn.over_sampling import SMOTE
import numpy as np
X = churn_dummies.loc[:, churn_dummies.columns != 'Churn_Yes']
y = churn_dummies.loc[:, churn_dummies.columns == 'Churn_Yes']
#checking variance_inflation_factors
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:185: RuntimeWarning: divide by zero encountered in double_scalars
#checking variance_inflation_factors
vif
| VIF Factor | features | |
|---|---|---|
| 0 | 4.460551 | MonthlyCharges |
| 1 | 1.067002 | SeniorCitizen_Yes |
| 2 | inf | TenureCat_Infant |
| 3 | inf | TenureCat_Child |
| 4 | inf | TenureCat_Teen |
| 5 | inf | TenureCat_Adult |
| 6 | inf | PaymentMethod_Automatic |
| 7 | inf | PaymentMethod_Electronic check |
| 8 | inf | PaymentMethod_Mailed check |
| 9 | inf | Techsavvy_High |
| 10 | inf | Techsavvy_Low |
| 11 | inf | Techsavvy_Medium |
| 12 | inf | Techsavvy_Unknown |
| 13 | inf | Streamer_Non-streamer |
| 14 | inf | Streamer_Streamer |
| 15 | inf | Streamer_Unknown |
X_2 = X.drop(['Techsavvy_Medium', 'Streamer_Unknown', 'TenureCat_Adult', 'PaymentMethod_Mailed check' ], axis = 1)
import statsmodels.api as sm
logit_model=sm.Logit(y,X_2)
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: Churn_Yes No. Observations: 6646
Model: Logit Df Residuals: 6634
Method: MLE Df Model: 11
Date: Mon, 09 Dec 2019 Pseudo R-squ.: inf
Time: 13:55:46 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
MonthlyCharges 0.0242 0.002 10.899 0.000 0.020 0.029
SeniorCitizen_Yes 0.5572 0.082 6.819 0.000 0.397 0.717
TenureCat_Infant 3.1145 0.130 23.919 0.000 2.859 3.370
TenureCat_Child 1.8066 0.123 14.661 0.000 1.565 2.048
TenureCat_Teen 1.1113 0.121 9.162 0.000 0.874 1.349
PaymentMethod_Automatic 0.0081 0.100 0.081 0.935 -0.187 0.203
PaymentMethod_Electronic check 0.6595 0.095 6.938 0.000 0.473 0.846
Techsavvy_High -0.3240 0.190 -1.710 0.087 -0.695 0.047
Techsavvy_Low 0.2735 0.140 1.956 0.050 -0.001 0.548
Techsavvy_Unknown -5.1331 0.175 -29.357 0.000 -5.476 -4.790
Streamer_Non-streamer -5.0534 0.264 -19.115 0.000 -5.572 -4.535
Streamer_Streamer -4.9674 0.292 -17.001 0.000 -5.540 -4.395
==================================================================================================
C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py:1789: RuntimeWarning: divide by zero encountered in log C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_2.values, i) for i in range(X_2.shape[1])]
vif["features"] = X_2.columns
vif
| VIF Factor | features | |
|---|---|---|
| 0 | 4.460551 | MonthlyCharges |
| 1 | 1.067002 | SeniorCitizen_Yes |
| 2 | 2.041885 | TenureCat_Infant |
| 3 | 1.770848 | TenureCat_Child |
| 4 | 1.626470 | TenureCat_Teen |
| 5 | 2.024099 | PaymentMethod_Automatic |
| 6 | 2.055850 | PaymentMethod_Electronic check |
| 7 | 2.328867 | Techsavvy_High |
| 8 | 4.972511 | Techsavvy_Low |
| 9 | 2.912105 | Techsavvy_Unknown |
| 10 | 32.820043 | Streamer_Non-streamer |
| 11 | 23.691759 | Streamer_Streamer |
#droping Streamer_Non-streamer for multicollinearity
#droping PaymentMethod_Automatic because of P-value > .05
X_3 = X_2.drop(['Streamer_Non-streamer', 'PaymentMethod_Automatic' ], axis = 1)
X_3
| MonthlyCharges | SeniorCitizen_Yes | TenureCat_Infant | TenureCat_Child | TenureCat_Teen | PaymentMethod_Electronic check | Techsavvy_High | Techsavvy_Low | Techsavvy_Unknown | Streamer_Streamer | |
|---|---|---|---|---|---|---|---|---|---|---|
| customerID | ||||||||||
| 7590-VHVEG | 29.85 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 5575-GNVDE | 56.95 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3668-QPYBK | 53.85 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7795-CFOCW | 42.30 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 9237-HQITU | 70.70 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6840-RESVB | 84.80 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2234-XADUH | 103.20 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 4801-JZAZL | 29.60 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| 8361-LTMKD | 74.40 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3186-AJIEK | 105.65 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
6646 rows × 10 columns
logit_model=sm.Logit(y,X_3)
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
Current function value: inf
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: Churn_Yes No. Observations: 6646
Model: Logit Df Residuals: 6636
Method: MLE Df Model: 9
Date: Mon, 09 Dec 2019 Pseudo R-squ.: inf
Time: 13:55:51 Log-Likelihood: -inf
converged: True LL-Null: 0.0000
Covariance Type: nonrobust LLR p-value: 1.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
MonthlyCharges -0.0097 0.001 -7.359 0.000 -0.012 -0.007
SeniorCitizen_Yes 0.5746 0.079 7.261 0.000 0.419 0.730
TenureCat_Infant 2.0356 0.099 20.469 0.000 1.841 2.230
TenureCat_Child 0.8691 0.098 8.878 0.000 0.677 1.061
TenureCat_Teen 0.2951 0.099 2.970 0.003 0.100 0.490
PaymentMethod_Electronic check 0.7297 0.066 11.039 0.000 0.600 0.859
Techsavvy_High -1.6253 0.167 -9.748 0.000 -1.952 -1.298
Techsavvy_Low -1.5728 0.096 -16.412 0.000 -1.761 -1.385
Techsavvy_Unknown -3.4955 0.130 -26.830 0.000 -3.751 -3.240
Streamer_Streamer 0.4232 0.084 5.052 0.000 0.259 0.587
==================================================================================================
C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py:1789: RuntimeWarning: divide by zero encountered in log C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available